Audio Programming |
Introduction |
Digital audio is the most commonly used method to represent sound inside
a computer. In this method sound is stored as a sequence of samples taken
from the audio signal using constant time intervals. A sample represents
volume of the signal at the moment when it was measured. In uncompressed
digital audio each sample require one or more bytes of storage. Number
of bytes required depends on number of channels (mono, stereo) and
sample format (8 or 16 bits, mu-Law, etc.). The length of this interval
determines the sampling rate. Normally used sampling rates are between
8 kHz (telephone quality) and 48 kHz (DAT tapes).
The physical devices used in digital audio are called ADC (Analog to Digital Converter) and DAC (Digital to Analog Converter). A device containing both ADC and DAC is commonly known as codec. The codec device used in Sound Blaster cards is called DSP which is somehow misleading since DSP also stands for Digital Signal Processor (the SB DSP chip is very limited when compared to "true" DSP chips). Sampling parameters affect quality of sound which can be reproduced from the recorded signal. The most fundamental parameter is sampling rate which limits the highest frequency than can be stored. It is well known (Nyquist's Sampling Theorem) that the highest frequency that can be stored in sampled signal is at most 1/2 of the sampling frequency. For example 8 kHz sampling rate permits recording of signal in which the highest frequency is less than 4 kHz. Higher frequency signals must be filtered out before feeding them to DAC. Sample encoding limits dynamic range of recorded signal (difference between the faintest and the loudest signal that can be recorded). In theory the maximum dynamic range of signal is number_of_bits * 6 dB . This means that 8 bits sampling resolution gives dynamic range of 48 dB and 16 bit resolution gives 96 dB. Quality has price. Number of bytes required to store an audio sequence depends on sampling rate, number of channels and sampling resolution. For example just 8000 bytes of memory is required to store one second of sound using 8 kHz/8 bits/mono but 48 kHz/16bit/stereo takes 192 kilobytes. A 64 kbps ISDN channel is required to transfer a 8kHz/8bit/mono audio stream and about 1.5 Mbps is required for DAT quality (48kHz/16bit/stereo). On the other hand it is possible to store just 5.46 seconds of sound to a megabyte of memory when using 48kHz/16bit/stereo sampling. With 8kHz/8bits/mono it is possible to store 131 seconds of sound using the same amount of memory. It is possible to reduce memory and communication costs by compressing the recorded signal but this is out of the scope of this document. OSS has three kind of device files for audio programming. The only difference between these device files is the default sample encoding used after opening the device. /dev/dsp uses 8 bit unsigned encoding while /dev/dspW uses 16 bit signed little endian (Intel) encoding and /dev/audio uses logarithmic mu-Law encoding. There are no other differences between the devices. All of them work in 8 kHz mono mode after opening them. It is possible to change sample encoding by using the ioctl interface after which all of these device files behave in similar way. However it is recommended that the device file is selected based on the encoding to be used. This gives the user more possibilities in establishing symbolic links for these devices. In short it is possible to record from these devices using the normal open(), close(), read() and write() system calls. Default parameters of the device files (see above) has been selected so that it is possible to record and play back speech and other signals with relatively low quality requirements. It is possible to change many parameters of the devices by calling the ioctl() functions defined below. All codec devices have capability to record or playback audio. However there are devices which don't have recording capability at all. Most audio devices have the capability of working in half duplex mode which means that they can record and playback but not at the same time. Devices having simultaneous recording and playback capability are called full duplex devices. The simplest way to record audio data is to use normal UNIX commands such as cat or dd. For example cat /dev/dsp > xyz records data from the audio device to a disk file called xyz until the command is killed (ctrl-C). Command cat xyz > /dev/dsp can be used to play back the recorded sound file. (Note that you may need to change recording source and level using a mixer program before recording to disk works properly). Audio devices are always opened exclusively. If another program tries to open the device when it is already open, the driver returns immediately an error (EBUSY). |
||
General programming guidelines |
It is highly recommended that you carefully read the the following notes
and also the programming guidelines chapter of the
introduction page. These notes are likely to prevent you from making
the most common mistakes with OSS API. At least you should read them
if you have problems in getting your program to work.
The following is a list of things that must be taken in account before starting programming digital audio. Features referred in these notes will be explained in detail later in this document.
|
||
Simple audio |
For simplicity recording and playback will be described separately.
It is possible to write programs which both record and play back audio data
but writing this kind of applications is not simple. They will be covered
in the later sections.
|
||
Declarations for an audio program |
In general all programs using OSS API should include soundcard.h
which is a C language header file containing definitions for the API. The
other header files to be included are ioctl.h, unistd.h
and fcntl.h. Other mandatory declarations for an audio application
are file descriptor for the device file and a program buffer
which is used to store the audio data during processing by the program.
The following is an example of declarations for a simple audio program:
|
||
Selecting and opening the device |
An audio device must be opened before it can be used (obvious). As
mentioned earlier, there are three possible device files which differe
only in the default sample encoding they use (/dev/dsp=8 bit unsigned,
/dev/dspW=16 bit signed little endian and /dev/audio=mu-Law).
It is important to open the right device if the program doesn't set the
encoding explicitly.
The device files mentioned above are actually just symbolic links to the actual device files. For example /dev/dsp points normally to /dev/dsp0 which is the first audio device detected on the system. User has freedom to set the symbolic links to point to other devices if it gives better results. It is good practice to use always the symbolic link (/dev/dsp) and not the actual device (/dev/dsp0). Programs should access the actual device files only if the device name is made easily configurable. It is recommended that the device file is opened in read only (O_RDONLY) or write only (O_WRONLY) mode. Read write mode (O_RDWR) should be used only when it is necessary to record and play back at the same time (duplex mode). The following code fragment can be used to to open the selected device (DEVICE_NAME). open_mode should be O_WRONLY, O_RDONLY or O_RDWR. Other flags are undefined and must not be used with audio devices.
|
||
Simple recording application |
Writing an application which reads from an audio device is very easy
as long as recording speed is relatively low, the program doesn't perform
time consuming computations and when there are no strict real time response
requirements. Solutions to this kind of problem will be presented later
in this document. All the program needs to do is to read data from the
device and to process or store it in some way. The following code fragment
can be used to read data from the device:
Number of bytes recorded from the device can be used to measure time precisely. Audio data rate (bytes per second) depends on sampling speed, sample size and number of channels. For example when using 8 kHz/16bits/stereo sampling the data rate is 8000*2*2 = 32000 bytes/second. This is actually the only way to know when to stop recording. It is important to notice that there is no end of file condition defined for audio devices. Error returned by read() usually means that there is a (permanent) hardware error or that the program has tried to do something which is not possible. It is not possible to recover from errors by trying again (closing and reopening the device may help in some cases). |
||
Simple playback application |
A simple playback program works exactly like a recording program. The
only difference is that a playback program calls write().
|
||
Setting sampling parameters |
There are three parameters which affect quality (and memory/bandwidth
requirements) of sampled audio data. These parameters are the following:
|
||
Selecting audio format |
Sample format is an important parameter which affects quality of audio
data. OSS API supports several different sample formats but most devices
support just few of them. soundcard.h
defines the following sample format identifiers:
Applications should check that the sample format they require is supported by the device. Unsupported formats should be handled by converting data to another format (usually AFMT_U8). Alternatively the program should abort if it cannot do the conversion. Trying to play data in unsupported format is a fatal error. The result is usually just LOUD noise which may damage ears, headphones, speakers, amplifiers, concrete walls and other unprotected objects. The above format identifiers have been selected so that AFMT_U8=8 and AFMT_S16_LE=16. This makes these identifiers compatible with older ioctl() calls which were used to select number of bits. This is valid just for these two formats so format identifiers should not be used as sample sizes in programs. AFMT_S16_NE is a macro provided for convenience. It is defined to be AFMT_S16_LE or AFMT_S16_BE depending of endianess of the processor where the program is being run. Number of bits required to store a sample is:
It is very important to check that the value returned in the argument after the ioctl call matches the requested format. If the device doesn't support this particular format, it rejects the call and returns another format which is supported by the hardware. A program can check which formats are supported by the device by calling ioctl SNDCTL_DSP_GETFMTS like in the following:
SNDCTL_DSP_GETFMTS returns only the sample formats that are actually supported by the hardware. It is possible that the driver supports more formats using some kind of software conversions (signed <-> unsigned, big endian <-> little endian or 8bits <-> 16bits). These emulated formats are not reported by this ioctl() but SNDCTL_DSP_SETFMT accepts them. The software conversions consume significant amount of CPU time so they should be avoided. Use this feature only if it is not possible to modify the application to produce supported data format directly. AFMT_MU_LAW is a data format which is supported with all devices. OSS versions before 3.6 reported this format always in SNDCTL_DSP_GETFMTS.Versions 3.6 and later report it only if the device supports mu-Law format in hardware. This encoding is to be used only with applications and audio files ported from systems using mu-Law encoding (SunOS). |
||
Selecting number of channels(mono/stereo) |
Most modern audio devices support stereo mode (the default mode is
mono). An application can select stereo mode by calling ioctl SNDCTL_DSP_STEREO
like below. It is important to notice that only values 0 and 1 are allowed.
Result of using any other value is undefined.
|
||
Selecting sampling rate (speed) |
Sampling rate is the parameter that determines much of the quality
of an audio sample. OSS API permits selecting any frequency between 1 Hz
and 2 GHz. However in practice there are limits set by the audio device
being used. The minimum frequency is usually 5 kHz while the maximum frequency
varies widely. Oldest sound cards supported at most 22.05 kHz (playback)
or 11.025 kHz (recording). Next generation supported 44.1 kHz (mono) or
22.05 kHz (stereo). With modern sound devices the limit is 48 kHz (DAT
quality) but there are still few popular cards that support just 44.1 kHz
(audio CD quality).
The default sampling rate is 8 kHz. However an application should not depend on the default since there are devices that support only higher sampling rates. The default rate could be as high as 48 kHz with such devices. Codec devices usually generate the sampling clock by dividing frequency of a high speed crystal oscillator. In this way it is not possible to generate all possible frequencies in the valid range. For this reason the driver always computes the valid frequency which is closest to the requested one and returns it to the calling program. The application should check the returned frequency and to compare it with the requested one. Differences of few percents should be ignored since they are usually not audible. The following code fragment can be used to select the sampling speed:
|
||
Other commonly used ioctl calls |
It is possible to implement most audio processing programs without using
other ioctl calls than the three ones described earlier. This
is possible if the application just opens the device, sets parameters,
calls reads or writes continuously (without noticeable delays or pauses)
and finally closes the device. This kind of applications can be described
as "stream" or "batch" applications.
There are three additional calls which may be required with slightly more complicated programs. These new calls are the following:
There are few places where these calls should be used:
|
||
Interpreting audio data |
Encoding of audio data depends on the sample
format. There are several possible formats and only the most common
ones are described here.
This is a format originated from digital telephone technology. Each sample is represented as a 8 bit value which is compressed from the original 16 bit value. Due to logarithmic encoding, the value must be converted to linear format before used in computations (summing two mu-Law encoded values gives nothing useful). The actual conversion procedure is beyond the scope of this text. Avoid mu-Law if possible and use the 8 or 16 bit linear formats. This is the normal PC sound card ("Sound Blaster") format which is supported by practically any hardware. Each sample is stored in a 8 bit byte. Value of 0 represents the minimum level and 255 the maximum. The neutral level is 128 (0x80). However in most cases there is some noise in recorded "silent" files so the byte values may vary between 127 (0x7f) and 129 (0x81). The C data type to be used is unsigned char. In case there is need to convert between unsigned and signed 8 bit formats, you should add or subtract 128 from the value to be converted (depending on the direction). In practice, XORing the value with 0x80 does the same (value ^= 0x80). Caution! Great care must be taken when working with 16 bit formats. 16 bit data is not portable and depends on design of both CPU and the audio device. The situation is simple when using a (little endian) x86 CPU with a "normal" soundcard. In this case both the CPU and the soundcard use the same encoding for 16 bit data. However the same is not true when using 16 bit encoding in big endian environment such as Sparc, PowerPC or HP-PA. The 16 bit encoding normally used by sound hardware is little endian (AFMT_S16_LE). However there are machines with built in audio chip which support only big endian encoding. When using signed 16 bit data, the C data type best matching this encoding is usually signed short. However this is true only in little endian machines. In addition C standards don't define sizes of particular data types so there is no guarantee that short is 16 bits long in all machines (in future). For this reason using array of signed short as an audio buffer should be considered as a programming error although it is commonly used in audio applications. The proper way is to use array of unsigned char and to manually assemble/disassemble the buffer to be passed to the driver. For example:
|
||
Encoding stereo data |
When using stereo data, there are two samples for each time slot. The
left channel data is always stored before right channel data. The samples
for both channels are encoded as described above.
Representation of samples to be used with more than 2 channels is to be defined in future. |
||
Conclusion |
The above is all you need to use when implementing "basic" audio applications. There are many other ioctl calls but they are usually not required. However there are "real time" audio applications such as games, voice conferencing systems, sound analysis tools, effect processors and many others. Unfortunately the above is not enough when implementing this kind of programs. More information about other audio programming features can be found in the Making audio complicated section. Be sure you understand everything described above before jumping into that page. |